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Abstract: Two existing approaches to functional principal components 
analysis (FPCA) are due to Rice and Silverman (1991) and Silverman (1996), 
both based on maximizing variance but introducing penalization in differ- 
ent ways. In this article we propose an alternative approach to FPCA using 
penalized rank one approximation to the data matrix. Our contributions 
are four-fold: (1) by considering invariance under scale transformation of 
the measurements, the new formulation sheds light on how regularization 
should be performed for FPCA and suggests an efficient power algorithm for 
computation; (2) it naturally incorporates spline smoothing of discretized 
functional data; (3) the connection with smoothing splines also facilitates 
construction of cross-validation or generalized cross-validation criteria for 
smoothing parameter selection that allows efficient computation: (4) differ- 
ent smoothing parameters are permitted for different FPCs. The method- 
ology is illustrated with a real data example and a simulation. 
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1. Introduction 

Principal components analysis is a key technique for unsupervised functional 
data analysis Ramsay and Silverman (2005) where the goal is to decompose 
variation in a two-way data table. Two different approaches to smoothed func- 
tional principal components analysis (FPCA) were proposed by Rice and Sil- 
verman (1991) and Silverman (1996), both of which we briefly describe in what 
follows. Let X(-) denote a random function which we assume for now can be 
observed repeatedly and as a whole, without discretization; in addition, let a 
be a smoothing penalty parameter. To find the j-th principal component weight 
function 7j(-), the Ricc-Silverman approach maximizes 

var(/ 7 X) - a J 7' 
JV 

subject to the constraint J 77^. = for k < j, where 7^ is the estimated fc-th 
principal component function, while the Silverman approach maximizes 

var(J 7 X) 
J 7 2 + a J 7" 2 

subject to J 77^ + a J ^"^'l = for k < j. Both approaches impose smoothness 
on principal components using roughness penalty ideas, but they differ in the 
way they incorporate the penalty. The variance var(J 7 A) in (1) and (2) needs 
to be estimated from a random sample of realizations of X(-). 

Following Hotclling (1933), maximizing variance of a standardized linear 
combination of variables is the standard textbook treatment of principal com- 
ponents analysis (PCA, e.g., Mardia et al., 1979). A different approach is by 



(1) 



(2) 
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way of fitting low rank approximations to the data matrix (Pearson, 1901), 
and this approach is intimately connected to the singular value decomposition 
(Eckart and Young, 1936). See Jolliffc (2002) for further discussion and refer- 
ences. In this article we apply the roughness penalty idea to a low rank approach 
to FPCA. As we will see, there exist difficulties in introducing roughness penal- 
ties, and a guiding principle in navigating these difficulties will be certain invari- 
ance under scale transformations. Scale invariance considerations will not only 
allow us to select the proper form of penalty, but also to re-derive and thereby 
justify the regularized variance criterion of Silverman (1996). However, penalized 
low rank approximation combined with invariance principles amounts to more 
than a novel justification for an existing approach; it has several methodological 
advantages over existing regularized variance approaches. 

First, our approach yields a power algorithm that is an efficient variant of 
the power algorithm (e.g., Appendix A of Jolliffc, 2002) for calculating eigen- 
vectors. Second, spline smoothing of discretized data is naturally built into our 
method which therefore gains the theoretical advantages of smoothing splines: 
our method applies directly to discretized data, and the estimated FPCs are 
solutions of an optimization problem defined on a function space. More im- 
portantly, the connection of our method to spline smoothing helps us develop 
computationally efficient cross-validation (CV) and generalized cross-validation 
(GCV) criteria for selecting smoothing parameters. This development fills a gap 
in the FPCA literature as is shown by a comparison: both Rice and Silverman 
(1991) and Silverman (1996) assume the whole curve is available through ei- 
ther interpolation or smoothing, while (Ramsay and Silverman, 2005, Chapters 
8 & 9) represents functional data using a basis expansion prior to performing 
PCA. Finally, our method allows different principal components to have differ- 
ent smoothing parameters, which is a beneficial flexibility that is not shared by 
the method of Silverman (1996). 

In this paper we focus on functional data that are sampled on a common 
grid across subjects — a typical setting of functional data analysis. Sometimes, 
the functions may be irregularly or even sparsely sampled, as often occurs in 
biomedical longitudinal studies. FPCA methods for sparsely sampled functional 
data or longitudinal data have been developed, among others, by James (2000), 
Rice (2001), and Yao et al. (2005a,b). 

The rest of the article is organized as follows. Section 2 presents the new 
method for extracting principal components for finite dimensional vector spaces; 
Section 5 extends the method to function space. Section 3 describes the power 
algorithm. Section 4 discusses the criteria for smoothing parameter selection and 
also provides the derivations. Section 6 gives some numerical results. Finally, the 
Appendix contains some technical details. 

2. A new framework on functional principal components 

Since functional data are usually observed discretely, this section presents our 
method in terms of discretized data. Unlike standard multivariate analysis, we 
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will take into account the intrinsic functional structure of the data with gen- 
eral Ridge penalties, which in the functional case will be smoothness penalties. 
Recovery of the whole curve of principal component weight functions involves 
spline interpolation and will be discussed in Section 5. In what follows we fo- 
cus on the first principal component; subsequent principal components can be 
extracted sequentially by removing preceding components. 

2.1. Minimizing reconstruction error 

Consider a collection or sample of functional data, observed or recorded at a 
common set of discrete observation points ti,...,t m . Denote the underlying 
functions for the sample as Xi(-), i = l,...,n. The observed data are Xij = 
Xi(tj), i = 1, . . . , 7i, j = 1, . . . , m. Let the n x m data matrix be X = (xtj). For 
simplicity and without loss of generality, suppose that X is column centered, 
that is, the sample mean of each column of X is zero. 

Consider the problem of finding the best rank one approximation of X. Any 
rank one matrix of size n x m can be written as uv T , where u = (ui, . . . , u n ) T 
and v ~ (tji, . . . , v m ) T . Let ||X||i? denote the Frobenius norm of X. The problem 
can be formally stated as minimizing with respect to u and v the following 
reconstruction error, 

n m 

||X - uv r ||| — tr{(X - uv r )(X - uv r f} = £ - u iVj f. (3) 

i=i j=i 

For a fixed v, the u that minimizes (3) is u = Xv/v T v. Plugging this u into (3) 
we find that the minimizing v of (3) maximizes v T X T Xv/v T v. Since X T X 
is proportional to the sample variance-covariance matrix, v T X T Xv/v T v is 
proportional to the sample variance of the data projected onto the direction 
of v. Thus v is, up to a scale factor, the first principal component loading 
(weight) vector for X according to standard multivariate analysis textbooks 
(e.g., Mardia ct al., 1979). We also note that the minimizing u and v of (3) are 
the first singular vector pair of X, again up to a scale factor. 

The appeal of minimizing (3) is that it resembles a prediction problem. If u 
were observable, then (3) would be the least squares criterion for a multivariate 
regression. The difference is that the predictor variable u needs to be estimated 
here. Based on the above observation, we next discuss how to modify (3) to 
take into account the functional nature of data by smoothing with a roughness 
penalty. The connection with a prediction problem will be critical once again 
for deriving cross-validation criteria in Section 4. 

2.2. Penalization and transformation invariance 

Following the basic philosophy of functional data analysis, we assume that there 
is an underlying smooth function "/(■) such that Vj = 7(tj). If the domain of val- 
ues t is the real line, we can assume ti sorted. Because of the smoothness of 7(-), 
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a pair of adjacent values vj and Wj+i are necessarily tied to each other by cor- 
relation. Using the roughness penalty idea, it is natural to consider minimizing 

||X-uv T |||, + av T Ov, (4) 

where £1 is a non-negative definite roughness penalty matrix, and a > is a 
penalty parameter. The penalty matrix is chosen such that a larger value 
of v T Ov is associated with a greater penalty on differences between adjacent 
values. For example, an intuitive choice of £1 for equispaced tj based on second 
differences of v would be v T f2v = YlT=2 — + v j-i) 2 - Construction of 

general penalty matrices that are suitable for both smoothing and interpolation 
will be discussed in Section 5 (see Theorem 1). 

An immediate problem with (4) is that it is not invariant under the pair of 
scale transformations u — > cu and v — > v/c. While the goodness-of-fit term 
|X — uv T ||^ does not change, the roughness penalty term v T f2v changes. A 
simple possible fix is to make the penalty term unitless and to minimize 

||X-uvl|+a^. (5) 

Fixing v, the minimizing u is u = Xv/v T v, which can be plugged back into 
(5) to show that minimizing (5) is equivalent to maximizing 

v T X T Xv — a v T flv , . 

T • 6 

This maximizing criterion is essentially the same as the Rice-Silverman criterion 
expressed in (1). 

However, the criterion (5) has a defect: The optimization problem is not in- 
variant under scale transformation of the measurements. Under the transforma- 
tion X — > cX and the corresponding transformation u — > cu, the goodness-of-fit 
term becomes c 2 ||X — uv T |||,, while the penalty term remains unchanged. The 
reason is that the goodness-of-fit term is not unitless but the penalty term is. 
The criterion (4) has the same defect because of a mismatch of units of the two 
terms involved. — This motivates us to consider minimizing 

||X-uv T |||, + au T uv T Ov, (7) 

where the two terms now have the same units. Optimization of (7) is invariant 
under scale transformation in the following sense: If u and v form its minimizcr 
for the data matrix X, then u* = cu and v* = v form its minimizer for the 
rescaled data matrix X* = cX. In other words, the minimizing v of this criterion 
is the same before and after the scale transformation X — > cX. — Fixing v, the 
minimizing u is u = Xv/v T (I m + af2)v, which can be plugged back into (7) 
to show that minimizing (7) is equivalent to maximizing 



v T X T Xv 
v T v + a v T flv 



(8) 
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This maximizing criterion is essentially the same as the Silverman criterion 
expressed in (2). 

Use of the penalized reconstruction error combined with invariance consider- 
ations indicates that the Silverman proposal is preferable to the Rice-Silverman 
proposal for functional principal components analysis. Such insight would not be 
forthcoming by studying penalized variance alone. In addition, as we shall show 
below, a direct methodological advantage of using the penalized reconstruction 
error (7) is that it facilitates extension of CV- and GCV- type smoothing pa- 
rameter selection criteria for spline smoothing to FPCA. 

From now on we shall focus on the criterion (7) for extracting smoothed 
principal components. An extension of (7) for extracting the whole principal 
component weight function is given in Section 5. Since we extract the principal 
component functions sequentially, multiple smoothing parameters are allowed. 
This is different from Silverman (199G)where a single smoothing parameter is 
used for extracting all principal component functions. The benefit of the flex- 
ibility provided by using multiple smoothing parameter will be illustrated in 
Section 6 with simulated data. 

Remark 1. Invariance of scale transformation of measurements is motivated 
by the consideration that results should remain the same under change of metric 
of the data. Such consideration of invariance might be meaningless if the data 
matrix would be standardized prior to analysis. However, there is no obvious 
way of standardization for functional data. If each column of the data matrix 
is standardized by dividing the corresponding sample standard deviation, for 
example, then the functional nature of the data will be lost. Standardizing 
the whole data matrix by the overall sample standard derivation is not a sound 
operation either, because the sample variance may vary from column to column. 

3. The power algorithm 

The criterion (7) can be minimized by alternating minimization of u and v 
in an iterative algorithm: Fixing v, u = Xv/v T (I m + a Jl)v; fixing u, v = 
(I+a r2)~ 1 X T u/u T u. Separating the scale constants, the algorithm is as follows: 

1. Initialize v. 

2. Repeat until convergence: 

(a) u^Xv, 

(b) v «- (I + ar2)- 1 X T u, 

(c) v<-v/||v|[. 

The initial v can be chosen to be the first right singular vector of X. Step 2. (c) 
forces v to have norm 1, which is somewhat arbitrary and only meant for identi- 
fiability purposes between u and v; any other normalization with different scale 
trade-offs between u and v would work, too. For example, an alternative restric- 
tion on v is v T (I + a S7)v = 1, but its dependence on the smoothing parameter 
a renders it less convenient. When starting from the right singular vector of X, 
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this algorithm converges rather quickly, usually in a few iterations. The algo- 
rithm is a simple variant of the power algorithm for computing eigenvectors. 
The iteration in terms of v alone is v <— (I + afi) _1 X T 'Xv; v <— v/||v||. If 
there is no penalty (a = 0), the algorithm is essentially the power algorithm for 
conventional PC A (e.g., page 409 of Jolliffe, 2002). As discussed below in Sec- 
tion 4, this iterative algorithm naturally facilitates the derivation of an explicit 
cross-validation criterion for smoothing parameter selection. 



4. Choosing the smoothing parameter 

4- 1. Cross-validation and generalized cross-validation 

Our formulation of FPCA suggests a new method for selecting the smoothing 
parameter. Step 2. (b) of the iterative algorithm essentially smooths X T u with 
S(a) = (I+afl)" 1 to update v. This interpretation suggests that CV and GCV 
criteria for selecting the spline tuning parameter a (Green and Silverman, 1994) 
can be adopted to FPCA. The CV score is defined as 



CY(a) Jr [«i-MxHi] 2 (9) 

(l-{S(a)}„) 2 



and the GCV score is defined as 



2 



1 ||(I-S(a))(X r u)|. 
GCV(a) = — (10) 

l-itr{S(a)} W 

In the implementation of either method, we can simply nest CV- or GCV- 
selection of a inside the loop, i.e., in Step 2. (b) of the algorithm. 

The above motivation of CV and GCV is by analogy only. To give a formal 
justification, we show below in Section 4.2 that the CV score given above can 
indeed be derived from the basic idea of cross-validation, deletion of one data 
item at a time. Similarly, the GCV score can also be derived from the basic idea 
of GCV (Craven and Wahba, 1979). The technical difference of our derivations 
compared to those for smoothing splines is that we delete one column of X at 
a time, rather than one matrix entry. 

The connection of column deletion in FPCA with point deletion in spline 
smoothing is clearly seen by considering an extreme case of (7). If X has only 
one row, denoted by y T , then u = u is a scalar. Requiring u to have norm 1 and 
fixing its sign for idcntifiability, it is necessary that u = 1. Then (7) becomes 

||y - mv|| 2 + a u 2 v T r2v = ||y - v|| 2 + a v T J7v, 



which is exactly the penalized least squares criterion for smoothing splines. This 
connection motivates our column deletion procedure that we now elaborate. 
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4-2. Derivation of CV and GCV criteria 

Minimization of (7) with u fixed can be considered as a ridge regression. Con- 
ditional on u with v as the vector of regression coefficients, this ridge regression 
has the following response vector y, design matrix X, and conditional ridge 
penalty matrix Sl v u: 



y = 



v\i 



( xx \ 
x 2 

\ x m J 
-- a Hull 1 





f u 


. 


• ^ 


X = 





u 


. 




^ o 


. 


• u / 



where Xj is the j-th column of X, and where y is of size mn x f and X is of 
size ran x m. Both the design matrix X and the ridge penalty depend on u. It 
follows immediately that the penalized sum of squares (7) is equal to 



y - Xv 



v. 



(11) 



which constitutes a penalized LS problem for v. The associated penalized co- 
variance matrix is 



X T X 



ft 



v \ u 



(u T u) (I + afi), 



and thus the hat matrix of the ridge regression is 



H = X (X T X + J^u)" 1 X T = 



■XSX' 



where S = S(a) = (I + afl)- 1 . 

Consider now the cross-validation that deletes one column of X at a time. 
This corresponds to deleting a block of size n from y at a time. Partition the 
hat matrix H into ra x ra equal-sized blocks where each block corresponds to 



a column of X. Let ^ = (v[ 



(-j) 



v 



) be the v that minimizes (II) 



when the j'-th block of y and the corresponding rows of X are removed. We 
have the following lemma about the leave-out-one-column prediction errors. 

Lemma 1. The j-th leave-out-one-column cross-validated prediction error sum 
of squares is 



T 



(xju) 



u v., 



HI 



(1-7,H| 2 ) 2 



(12) 



where = Sjj/||u|| 2 , and Sjj is the (j,j)-th element of the matrix S = S(a). 

The proof of Lemma 1 is relegated to Appendix A. Note that Vj is the j-th 
element of v = S X T u/u T u and xju is the j-th element of X T u. Note also that 
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7,-IH 



Sjj. Since we condition on u, the first two terms on the right hand 



side of (12) are irrelevant. Averaging the last term in (12) over j, we obtain 



E 

3=1 



u v., 



1 T 

HI X J u 



{S(a)} 



13) 



E 

3=1 



[{(I-S)(X r u)}^ 
||u|P(l-{S(a)} 3 - i ) S 



(13) 



which, ignoring the ||u|| 2 factor in the denominator, is exactly the cross-validation 



criterion given in (9). Replacing 
(l/m)tr{S(a)}, we obtain 



[S(a)]jj in (13) by their average value, 



i||(I-S)(X^u)| 



1 - Mr{S(a)} 



•(l-±tr{S(a)} 



which, ignoring the ||u|| 2 factor, is the generalized cross-validation criterion given 
in (10). 

Remark 2. Both Rice and Silverman (1991) and Silverman (1996) suggest 
cross-validation based on row deletion in X. Row deletion lacks simple com- 
putational shortcuts such as (9); it hence involves actual computation of a large 
number of leave-out-one-row estimates. As a result, CV based on row deletion is 
computationally expensive as confirmed by our numerical studies in Section 6. 
Remark 3. Since the principal components are extracted sequentially, the 
squared errors in the numerators of the CV and GCV criteria vary with the 
principal components. The CV score (9) and the GCV score (10) are defined for 
extracting the first principal component. When each subsequent principal com- 
ponent is extracted, the X matrix in the definition of the CV and GCV scores 
needs to be replaced by the residual matrix after the effects of the previous 
principal components are removed. 



5. Extracting the whole curve of principal component weight 
function 

So far we have focused on the discretized problem, although the functional na- 
ture has been taken into account by regularization with a second-order rough- 
ness penalty. This section introduces an optimization criterion, the minimizer of 
which gives the entire principal component weight function. It also turns out that 
the extracted function is a natural cubic spline that interpolates the weighted 
vector obtained by minimizing (7). Our development relics on some standard 
results from spline smoothing that can be found in Green and Silverman (1994). 

Replacing the discretized vector v with the complete function 7(-), we propose 
to find the estimate of the first principal component weight function by 
minimizing, with respect to Ui and 7(-), the penalized sum of squares, 



n m / n \ r 

ED^-^i)}^ E«* w'wrdt, (w) 
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where a > is a smoothing parameter. The estimated principal component 
function 7(-) is the optimizer of (14) over the class of all functions that satisfy 
J {Y'(t)} 2 dt < co. Similar to (7), the two terms in (14) have the same units, and 
the optimization problem is invariant under scale transformation of the measure- 
ments. However, unlike (7), the whole function is recovered by optimizing (14). 

We now characterize the solution to the optimization problem (14) and show 
that it is closely related to the solution of the discretized problem (7) with 
an appropriately defined penalty matrix. To this end, we define two banded 
matrices, Q and R, as follows. Let hj = tj+i — tj for j = 1, . . . ,m — 1. Let Q be 
the to x (to — 2) matrix with entries qjk, for j = 1, . . . , to and k = 2, . . . , m — 1, 
given by 

q k -i,k = h^_ v q k k = -K-i - K ' Qk+i,k = K 1 
for k = 2, . . . , to — 1, and qjk = for \j — k\ > 2. To simplify the presentation, 
the columns of Q are numbered in a non-standard way, starting at k = 2, so that 
the top left element of Q is qi2- The symmetric matrix R is (to — 2) x (to — 2) 
with elements r^-fc, for j and k running from 2 to (to — 1), given by 

rjj = 3(^-1 + h j) for j = 2, . . . , to - 1, 
r ij+i = r j+i j = \ h 3 for j = 2, . . . , to - 2, 

and rjfc = for |j — k\ > 2. The matrix R is strictly diagonal dominant and 
thus is strictly positive-definite. 

Theorem 1. The 7(-) optimizing (14) is a natural cubic spline with knots attj. 
Let Vj = ^(tj) and v = (v±, . . . , v m ) T . Then v is i/ie optimizer of the discretized 
problem (7) wii/i i/ie penalty matrix = QR~ X Q T . 

The proof of Theorem 1 can be found in Appendix B. According to Theo- 
rem 1, to obtain the entire curve of the principal component weight function, one 
needs to first solve (7) with a penalty matrix CI = QR~ 1 Q T to obtain vi, . . . , v m , 
and then find the natural cubic spline that interpolates (tj ,i>j). Computation 
of the interpolating natural cubic spline 7(-) at any evaluation point t using its 
values at the knots tj is a standard operation. Specifically, let Vj = ~i(tj) and 
Sj = "/"(tj). By the definition of natural cubic spline, the second derivative of 
7 at ti and t rn is zero, so that s\ = s rn = 0. The interpolating natural cubic 
spline is completely determined by its values and second derivatives at each of 
the knots tj according to the following formula: 

= (t-tj)v j+1 + (tj+l -t)Vj 

hj 

-i«-y( t , +I - < ){(i + ^)» j+I + (i + ^)» i } 

for tj<t< tj+i, j = 1, . . . , m— 1; the spline outside [ii, t m ] is obtained by linear 
extrapolation. The vector s = (s2, . ■ ■ , s m _i) T of the second derivatives used 
above can be obtained by s = R~ 1 Q T v. See Chapter 2 of Green and Silverman 
(1994) for more details of efficient computation of interpolating splines. 
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6. Numerical results 

6.1. Call center arrival data example 

We applied the proposed method to the call center arrival data analyzed in 
Shen and Huang (2008). The data recorded the number of calls that got con- 
nected to a call center during every quarter hour between 7:00AM and midnight 
for every weekday between January 1 and October 26 in the year 2003. In to- 
tal, there are 42 whole weeks during the period and each day consists of 68 
quarter hours. Let Nij denote the call volume during the j-th time interval on 
day i. We used the transformed data Xij — -J + 1/4 which together form 
a 210 x 68 matrix. The square-root transformation is used to stabilize variance 
and make the distribution close to normal. The same transformation has been 
used previously by Brown et al. (2005) and Shen and Huang (2008). 

The mean curve from the data (or the column mean vector of X) is quite 
smooth and summarizes the average intraday arrival pattern (Figure 1). It is 
bimodal with a main peak around 11:00AM followed by a second lower peak 
around 2:00PM. We subtracted the mean curve from the data and then ap- 





FlG 1. The call center data. The mean curve and the estimated first three functional principal 
components. MPDC refers to our method of selecting multiple smoothing parameters using 
delete-column CV. SPDR refers to Silverman's method of selecting the single smoothing pa- 
rameters using delete-row CV. 
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plied the penalized sum of squares criterion (14) to sequentially extract the first 
three principal components. The leave-out-one-column CV is used to select the 
smoothing parameter for each principal component respectively. A set of grid 
points of a = and a = 1.5 1 for i = —5, . . . , 25 are examined as candidate 
values. The chosen values are 0.44, 25.63 and 38.44 respectively. The estimated 
smooth principal component weight functions are plotted in Figure 1. The ap- 
plication of the leave-out-one-column GCV leads to qualitatively very similar 
results. 

For comparison, we also followed Silverman (1996)and applied CV with row 
deletion to select a single smoothing parameter. This alternative method selects 
a = 0.13. The estimated principal component functions are plotted in Figure 1. 
We observe that Silverman's method undersmooths the principal component 
functions, while our method of using multiple smoothing parameters performs 
suitable smoothing. The computing times also demonstrate that the CV with 
row deletion is computationally much more expensive than our delete-column 
CV because of its lack of a computational shortcut. To generate the results in 
Figure 1, it took 55 and 2 seconds for the two approaches, respectively, using our 
R program running on a Debian Linux desktop with Intel® Pentium® 4 CPU 
of a clock speed of 2.8 Gigahertz. We used the computational tricks presented 
in Appendix C when implementing both approaches. 

6.2. A simulated data example 

To further understand the difference between the two methods used in Sec- 
tion 6.1, we performed the following simulation study. The data generating 
model is 

Xij =u il v 1 (tj)+Ui2V2(tj) + e i j, i=l,...,n\ j = l,...,m, (15) 

where Un ' N(0, of), u i2 1 ~ ' N(0, erf), and N(0, a 2 ). The parameters 

are chosen as n = m = 101, <r\ = 20, a 2 = 10, a = 4, and the 101 grid points tj 
are equally spaced in [—1,1]. The two underlying functional principal component 
functions are 

Vi(t) = — {£ + sin(7rf)} and i>2(i) = — cos(37rf), 
si s 2 

where s± and s 2 are the normalizing constants that ensure v\ and v 2 to have 
unit norm. We generated one hundred simulated data sets, and estimated the 
first two smooth principal component functions for each data set. We calculated 
the mean squared errors (MSE) over the 101 grid points for each estimated 
principal component function. 

Panels (a) and (b) of Figure 2 are the scatterplots of MSEs for the two 
methods, with each point representing a simulated data set. We observe that 
using single smoothing parameter yields larger MSEs for most simulated data 
sets, especially for the first functional principal component (FPC). The same 
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Fig 2. Comparison of two methods for estimating functional principal components by simu- 
lation study. MPDC refers to our method of selecting multiple smoothing parameters using 
delete-column CV. SPDR refers to Silverman's method of selecting the single smoothing pa- 
rameters using delete-row CV. The reported MSEs equal to the actual values multiplied by 10 4 . 



Table 1 

Summary statistics of the MSE ratios for the two methods in Figure 2. Ql and Q3 refer to 

the lower and upper quartiles. 





FPC 1 






FPC 2 




Ql 


Median Mean 


Q3 


Ql 


Median Mean 


Q3 


SPDR/MPDC 1.17 


1.51 1.64 


2.03 


1.01 


1.08 1.07 


1.20 



message is also evident in Panels (c) and (d) that provide the histograms of the 
ratios of MSEs of the two methods. Table 1 reports some summary statistics of 
the MSE ratio and shows that, by using a single instead of multiple smoothing 
parameters, the mean of the MSE ratio increases by 64% for the first FPC, and 
by 8% for the second FPC. We confirm the observed difference as significant by 
the sign test that gives the p-values that are essentially for both FPCs. We 
also observe that the difference of the two methods for estimating the second 
FPC is not as big as that for the first one. This is because the two methods 
select similar smoothing parameters for the second FPC. As far as computing 
times go, delete-row CV on average needs 116 seconds for one simulation while 
delete-column CV only takes 2 seconds, using the same computer reported in 
Section 6.1. 
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Fig 3. Same as Figure 2, except that a mean function has to be estimated. 



Table 2 

Same as Table 1, except that a mean function has to be estimated. 





FPC 1 






FPC 2 




Ql 


Median Mean 


03 


Ql 


Median Mean 


Q3 


SPDR/MPDC 1.22 


1.53 1.66 


2.08 


1.01 


1.08 1.07 


1.19 



To see the effect of estimating the mean function on extracting the principal 
component functions, a mean function was added to the data generating model 
(15). The mean was then removed by column centering of the data matrix prior 
to applying the FPCA algorithms. Figure 3 and Table 2 present the results after 
removing the mean, and when compared with Figure 2 and Table 1 , suggest that 
the effect of estimating the mean function is not very significant. 

Appendix A: Proof of Lemma 1 

Note that v^ - -?' also solves the ridge regression (11) when the j-th block of y is 
replaced by mr . The j-th block of the fitted equation y = Hy of this latter 
ridge regression reads as 

ur j ' ^ll,,x, • ll„{„r, J '\. 
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Subtracting Xj on both sides of the above identity and observing that J2k ^-jk^-k 
uirj, we obtain 



k 



uv 



ij-Xj + H#{u€j 3) x ; }. 
Therefore, the cross- validated residual for deleting the j-th column of X is 

where 



S 

u 1 u 

Denote w = u.Vj — Xj . Its squared norm is 



|w|| 2 =xjx, -2xJuu, + ||u|| 2 t3 2 



(x» 2 / „_ u T x 7 \ 2 (A.l) 



xjxj rr^ ttt 



u 



Since u T w = ||u|| 2 (u,- — u T Xj/||u|| 2 ), we have that 



^=flHI%-^] ■ (A.2) 



u 



Using the identity 



(i - TjWy 1 = i + - — 7j „ _ uu T 



12 



( — 7") 

we can write the cross- validated residual uu) ~ x j as 



(I - K n )-i(uv 3 Xj ) = I I + T -L. uu - 1 W 



7ill' 

7? 



l-7il|u|| 



r(u T w)u. 



Thus the squared norm of uuj ^ — x 3 - is 

ll-ll 2 + 1 2l \ ||2 (" Tw ) 2 + n 1 112,2 ("Wllull 2 

l- 7j ||u|| 2 (1 - 7i ||u|| 2 ) 2 



= w 



|2 , (u T w) 2 f 2 7 ,||u|| 2 7ll|u|| 4 



l|u|| 2 l(l- 7j ||u|| 2 ) (l-^llull 2 ) 5 



,„ (u T w) 2 r 1 
= w 2 + ^ £_j 1 

11 11 + ||u|P \(l- 7 Ju|| 2 ) 2 
Combining this result with (A.l) and (A.2) we obtain (12). □ 
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Appendix B: Proof of Theorem 1 

Since the natural cubic spline interpolant is the unique minimizer of 
J 7" 2 over all functions that interpolate the data (tj,Vj) (Theorem 2.3 of 
Green and Silverman, 1994), the minimizing function j(-) of (14) is necessar- 
ily a natural cubic spline with knots at the points tj. Therefore in (14) we 
can restrict attention to natural cubic splines with knots at tj. A natural cu- 
bic spline 7(t) that interpolates (tj 7 Vj) is uniquely defined. By Theorem 2.1 of 
Green and Silverman (1994), / {j"(t)} 2 dt = v T f2v, where v = (v\, . . . , v m ) T . 
Let X = (xij) and u = (m, . . . , u n ) T . It follows immediately that the penalized 
sum of squares (14) can be written as 

||X - uv T ||! + au T uv T f2v, 

which is exactly the discretized criterion (7) given in Section 2. □ 

Appendix C: Implementation details 

We provide in this appendix some implementation details of our method. We 
first describe a method of computing smoothed PC using any existing SVD 
software, then discuss efficient computation of the CV/GCV criteria for multiple 
candidate values of a. 

Note that the penalized reconstruction error criterion (7) can be expanded 
as follows: 

||X - uv T ||| + a u T u v T f2v = ||X||| - 2u T Xv + u T u v T (I + a Vt)w. 

Denote S(a) = (I + aO)- 1 , v = S~ 1 / 2 {a)\, and X = XS^ 2 {a). Then, the 
above expression is equivalent to 

11X11! - ||X||! + ||X||! - 2u T Xv + u T uv T v, 

which can be simplified as 

||X|||- ||X||! + ||X-uv T ||!. (A.3) 

For a given smoothing parameter a, the minimizing u and v of (A.3) can be 
easily obtained as the first pair of singular vectors of X = XS 1 / 2 (a), as discussed 
in Section 2.1. Since S 1//2 (a) can be interpreted as a half-smoothing operator, 
the transformed matrix X is obtained by half-smoothing the rows of the original 
data matrix X. After v is obtained as the first right singular vector of X, we 
half-smooth it to obtain the smoothed PC function v = S 1 / 2 (a)v. Thus by 
using existing SVD software, we can avoid directly programming the iterative 
power algorithm. This is convenient when a high level programming language 
is used for coding. 

The eigen decomposition of the symmetric and positive defintc penalty matrix 
n can be written as 

n = TAT T , 
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where T is an orthogonal matrix containing the eigenvectors, and A is a diagonal 
matrix of the eigenvalues. For any value of a, the eigen decompositions of the 
smoothing and half-smoothing matrices are 

S(a)=r(I + aAr 1 r T and S 1/2 (a) = T (I + a A)- 1/2 T T . 

Using the eigen decomposition of S 1 / 2 (a), we see that minimization of ||X — 
u v T |If is equivalent to minimization of ||Xr(I + aA) -1 / 2 — uv T |||. where v = 
r T v. Thus, after obtaining the first right singular vector v of Xr(I + qA) -1 ' 2 . 
we obtain v using v = S 1 / 2 (a)v = T(I + aA) _1 / 2 v. Note that Xr only needs 
to be computed once when v needs to be computed for a set of values of a. 
Now we discuss efficient evaluation of the GCV criterion (10). Since 

I - 8(a) = T{I - (I + aA)" 1 }^, 

the trace that appears in the denominator of (10) equals to 

tr{S(a)} = £ — 1 —, 
and the numerator in (10) is 

||{I - S( a )}(X T u)|| = || {i — (I + a A)^}(xr) T u||. 

Denote w = (Xr) T u. The numerator of the GCV criterion equals to the Eu- 
clidean norm of the shrunken w with the fc-th component shrunken by a factor 
of aAfc/(l + aXk)- When computation of GCV is needed for a set of candidate 
values of a, considerable computing saving is achieved since XT and the eigen 
decomposition of CI only need to be calculated once. 

References 

Brown, L. D., N. Cans, A. Mandelbaum, A. Sakov, H. Shen, S. Zeltyn, and 
L. Zhao. (2005). Statistical analysis of a telephone call center: a queueing- 
science perspective. J. Amer. Statist. Assoc. 100 36-50. MR2166068 

Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions. 
Numer. Math. 31 377-90. MR0516581 

Eckart, C. and Young, G. (1936). The approximation of one matrix by another 
of lower rank. Psychometrika 1 211-218. 

Green, P. J. and Silverman, B. W. (1994). Nonparametric Regression and Gen- 
eralized Linear Models: A Roughness Penalty Approach. Chapman & Hall. 
MR1270012 

Hotelling (1933). Analysis of a complex of statistical variables in principal com- 
ponents. J. Educ. Psychol. 24 417-441, 498-520. 

James G. M., Hastie, T. J. and Sugar, C. A. (2000). Principal component models 
for sparse functional data. Biometrika 87 587-602. MR1789811 



J.Z. Huang et al. /Functional PC A 



am 



Jolliffc, I. T. (2002). Principal Component Analysis. Springer- Vcrlag: New York, 

2nd ed. MR2036084 
Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. 

Academic Press: London. MR0560319 
Pearson, K. (1901). On lines and planes of closest fit to systems of points in 

space. Phil. Mag. 6 1-7. 
Ramsay, J. O. and Silverman. B. W. (2005). Functional Data Analysis. Springcr- 

Verlag: New York, 2nd cd. MR2168993 
Rice, J. A. and Silverman, B. W. (1991). Estimating the mean and covariance 

structure nonparamctrically when the data are curves. J. Roy. Statist. Soc. 

Ser. B 53 233-243. MR1094283 
Rice, J. A. and Wu, C. O. (2001). Nonparametric mixed effects models for 

unequally sampled noisy curves. Biometrics 57 253-59. MR1833314 
Silverman, B. W. (1996). Smoothed functional principal components analysis 

by choice of norm. Ann. Statist. 24 1-24. MR1389877 
Shcn, H. and Huang, J. Z. (2008). Interday forecasting and intraday updating of 

call center arrivals. Manufacturing and Service Operations Management 10 

391-410. 

Yao, F., Miillcr, H.-G. and Wang, J.-L. (2005a). Functional data analysis for 
sparse longitudinal data. J. Amer. Statist. Assoc. 100 577-90. MR2160561 

Yao, F., Miillcr, H.-G. and Wang, J.-L. (2005b). Functional linear regression 
analysis for longitudinal data. Ann. Statist. 33 2873-903. MR2253106 



